Aggregation of Imprecise and Uncertain Information for Knowledge Discovery in Databases
نویسندگان
چکیده
We consider the problem of aggregation for uncertain and imprecise data. For such data, we define aggregation operators and use them to provide information on properties and patterns of data attributes. The aggregates that we define use the Kullback-Leibler information divergence between the aggregated probability distribution and the individual tuple data values. We are thus able to provide a probability distribution for the domain values of an attribute or group of attributes using imperfect data. Information stored in a database is often subject to uncertainty and imprecision. An extended relational data model has previously been proposed for such data which allows us to quantify our uncertainty and imprecision about attribute values by representing them as a probability distribution. Our aggregation operators are defined on such a data model. The provision of such operators is a central requirement in furnishing a database with the capability to perform the operations necessary for Knowledge Discovery in Databases. Background Frequently, real life data are uncertain, i.e. we are not certain about the truth of an attribute value, or imprecise, i.e. we are not certain about the specific value of an attribute. Such imprecision might occur naturally as a result of data being provided at different levels of the concept hierarchy or from the integration of distributed databases. It is therefore important that appropriate functionality is provided for generalised database systems to provide intelligent ways of handling such imperfect information. A recent survey of various approaches to handling imperfect information in Data and Knowledge Bases has been provided by Parsons (1996). A database model which is based on partial values and partial probabilities (DeMichiel, 1989; Tseng et al., 1993; Copyright 1998, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Chen and Tseng, 1996) has been proposed to handle both imprecise and uncertain data. Partial values may be thought of as a generalisation of null values: rather than not knowing anything about a particular attribute value, we may identify the attribute value as belonging to a set of possible values. Partial probabilistic values (Chang and Chen, 1994) further generalise by assigning probabilities to the partial values. We may thus combine the imprecision-handling capacity of partial values with the uncertainty-handling capabilities of probability theory. This capacity to handle both imprecise and uncertain data is a major strength of our current approach. The facility to aggregate such data is a central requirement in providing a database with the capability to perform the operations necessary for Knowledge Discovery in Databases, where we are frequently concerned with identifying interesting attributes or interesting relationships between attributes. Data Models for Imprecise and Uncertain Data In a generalised Relational Database we consider an attribute A and tuple t of a relation R which has values which are imprecise in that in any particular tuple an attribute value may be a partial value. Definition 1. A partial value is determined by a set of possible attribute values of tuple t of attribute A of which one and only one is the true value. We denote a partial value η by η = [ a a r s ,... ] corresponding to a set of h possible values { a a r s ,... } of the same domain, in which exactly one of these values is the true value of η. Here, h is the cardinality of η; { a a r s ,... }is a subset of the domain set {a a 1 ,... k }of attribute A of relation R and h≤k. We extend the definition to a partial probability distribution, where probabilities are assigned to each partial value. From: KDD-98 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. Definition 2. A partial probability distribution is a vector of probabilities φ(η) = (p 1 ,...p e) which are associated with a partition of partial values η = (η 1 ,.. η e) of tuple t of attribute A such that pi = probability (the attribute value is a member of partial value set ηi). Here . i i e p = ∑ = 1 1 Example 1: For a partial probability distribution to be defined on the attribute SMOKING_STATUS, we have a set of partial values which forms a partition of the domain, such as ({‘heavy’, ‘light’}, {‘ex-heavy’, ‘ex-light’}, {‘never}). We then define a partial probability distribution on this partition, e.g.<{‘heavy’, ‘light’}, 0.4>, <{‘exheavy’, ‘ex-light’}, 0.35>, <{‘never’}, 0.25>. This distribution means, for example, that the probability of being either a heavy or a light smoker is 0.4. An extended relational model in which the tuples consist of partial values was described in DeMichiel (1989) and Chen and Tseng (1996). We extend this data model to an extended relational database model based on a partial probability distribution. An equivalent data model, along with some extended relational operators such as project and join, has been proposed by Barbará et al. (1992) who introduced the term probability data model (PDM). ID SEX SMOKING STATUS HYPER TENSION HEART DISEASE 01 <{M},0.0> <{F},1.0> <{S},1.0> <{E},0.0> <{N},0.0> <{S,M},1.0> <{N},0.0> <{S},0.75> <{M,N},0.25> 02 <{M},1.0> <{F},0.0> <{S},0.0> <{E},1.0> <{N},0.0> <{S,M,N},1.0> <{S},0.75> <{M},0.2> <{N},0.05> 03 <{M},0.0> <{F},1.0> <{S},0.0> <{E},0.2> <{N},0.8> <{S,M},0.0> <{N},1.0> <{S,M},0.0> <{N},1.0> 04 <{M},1.0> <{F},0.0> <{S},0.9> <{E},0.1> <{N},0.0> <{S},1.0> <{M,N},0.0> <{S},1.0> <{M,N},0.0> 05 <{M},0.0> <{F},1.0> <{S},0.0> <{E},1.0> <{N},0.0> <{S,M},0.0> <{N},1.0> <{S},0.2> <{M,N},0.8> 06 <{M},1.0> <{F},0.0> <{S},1.0> <{E},0.0> <{N},0.0> <{S,M},1.0> <{N},0.0> <{S},1.0> <{M,N},0.0> 07 <{M},1.0> <{F},0.0> <{S},0.0> <{E},0.0> <{N},1.0> <{S,M},0.0> <{N},1.0> <{S},0.1> <{M},0.2> <{N},0.7> 08 <{M},1.0> <{F},0.0> <{S},1.0> <{E},0.0> <{N},0.0> <{S},1.0> <{M,N},0.0> <{S},0.9> <{M},0.05> <{N},0.05> 09 <{M},0.0> <{F},1.0> <{S},0.75> <{E},0.25> <{N},0.0> <{S,M},1.0> <{N},0.0> <{S},0.0 > <{M},1.0> <{N},0.0> 10 <{M},1.0> <{F},0.0> <{S},1.0> <{E},0.0> <{N},0.0> <{S,M},0.0> <{N},1.0> <{S},0.2> <{M,N},0.8> Table 1. A Probability Data Table Table Legend SEX: M male, F female; SMOKING_STATUS:S smoker, E ex smoker, N never; HYPERTENSION:S severe, M mild, N no hypertension; HEART_DISEASE:S severe, M mild, N none. Definition 3. A probability data model (or table) R is a relation based on the probability distribution of partial values for domains D1, D2,...Dn of attributes A1, A2,...An where R ⊆ P1 x P2 x...Pn and Pi is the set of all the probability distributions on the power set of domain Di. Each element of the table is a probability distribution specified on a partition of the appropriate domain into partial values. An example of a probability data model is presented in Table 1. This general relational data model allows us to represent a number of other models as special cases. For example, the attribute SEX is crisp with no null values, i.e. a value is M with probability 1 or F with probability 1. The attribute SMOKING_STATUS consists entirely of crisp probabilistic values, i.e. in each case we know probabilities for each of the three possible values smoker, ex-smoker or never smoked. The attribute HYPERTENSION consists of true partial values: here attribute values are imprecise but not uncertain. The final attribute, HEART_DISEASE, is a true partial probability distribution; the values are therefore both imprecise and uncertain. Aggregation of Partial Probabilities In this section we develop an approach which allows us to aggregate attributes of a partial probability distribution relation such as that presented in Table 1. Thus this general aggregation operator (gagg) must include, as special cases, crisp and certain data (e.g. SEX in Table 1) and crisp and uncertain data (e.g. SMOKING_STATUS). Notation: As before, we consider an attribute Aj of a partial probability relation R with corresponding domain Dj ={v1,...vk} which has tuples t1,...tm.. The value of the rth tuple of Aj is a probability distribution described by vectors tr . Aj = ( fr j 1 ( ) ,....., f rg j r ( ) ) and P = ( Sr1 j ( ) ,....., Srg j r ( ) ), where P is a partition of the domain Dj and fr" (j) =Prob(tuple value is a member of Sr" (j) ); gr is the number of sets in the partition for tuple tr. We further define: q ir i r (j) if v S otherwise. " " = ∈ 1 0 Definition 4: The general aggregate of a number of probability distribution values on attribute Aj of relation R, denoted gagg ( R. Aj ), is defined as a vector-valued function: gagg ( R. Aj ), = (π1,...πk) in which the πi ’s are computed from the iterative scheme:
منابع مشابه
Discovery of Abstract Knowledge from Non-Atomic Attribute Values in Fuzzy Relational Databases
In this paper we introduce attribute-oriented induction with partial vote propagation – a new approach allowing acquisition of generalized knowledge from uncertain data. We utilize a proximity-based fuzzy relational database as the medium carrying the original information, where the lack of precise information about an entity is reflected via insertion of multiple attribute values, and the fuzz...
متن کاملA Fuzzy Rule-based Expert System for the Prognosis of the Risk of Development of the Breast Cancer
Soft Computing techniques play an important role for decision in applications with imprecise and uncertain knowledge. The application of soft computing disciplines is rapidly emerging for the diagnosis and prognosis in medical applications. Between various soft computing techniques, fuzzy expert system takes advantage of fuzzy set theory to provide computing with uncertain words. In a fuzzy exp...
متن کاملImprecise Database Inference Using Functional Dependencies
Knowledge discovery in databases can be enhanced by augmenting them with \catalytic relations" conveying external common sense knowledge. Catalytic inference analysis ? the systematic analysis of inference closures in databases augmented with catalytic information ? uncovers new facts and rules, and latent inference channels. This paper presents a formalism for analyzing imprecise inference bas...
متن کاملAnalyzing FD Inference in Relational Databases
Imprecise inference models the ability to infer sets of values or information chunks. Imprecise database inference is just as important as precise inference. In fact, it is more prevalent than its precise counterpart even in precise databases. Analyzing the extent of imprecise inference is important in knowledge discovery and database security. Imprecise inference analysis can be used to \mine"...
متن کاملData envelopment analysis for imprecise data in Buyer-Seller Relationship
In the environment of business‐to‐business e‐commerce, Buyers and sellers in mature industrial markets can turn single transactions into long-term beneficial relationships by a deeper understanding of the complex connection between the two and buyers and sellers are uncertain about their roles. A “must-do” for the sellers, in particular, is to understand patterns of investment and reward,...
متن کاملManaging Continuous Uncertain Data by a Probabilistic XML Database Management System
Database systems are widely used in today’s world. Almost every information system contains one or more databases. From a traditional perspective, databases are used to store precise values about objects in the ’real world’. However, many information is uncertain or imprecise. Consider, for example, sensor applications. Sensors produce uncertain and imprecise data since readings of sensors are ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998